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^ Synonyms 



Bias-variance trade-offs, bias plus variance. 



O 

\0 Definition 



Consider a given random variable F and a random variable that we can modify, P. We wish to use a 
sample of F as an estimate of a sample of F[. The mean squared error between such a pair of samples is a 
sum of four terms. The first term reflects the statistical coupling between F and F and is conventionally 
ignored in bias-variance analysis. The second term reflects the inherent noise in F and is independent 
of the estimator F. Accordingly, we cannot affect this term. In contrast, the third and fourth terms 
depend on _F. The third term, called the bias, is independent of the precise samples of both F_ and 
F, and reflects the difference between the means of F and F. The fourth term, called the variance, is 
independent of the precise sample of -F, and reflects the inherent noise in the estimator as one samples 
it. These last two terms can be modifled by changing the choice of the estimator. In particular, on 
small sample sets, we can often decrease our mean squared error by, for instance, introducing a small 
00 bias that causes a large reduction the variance. While most commonly used in machine learning, this 

article shows that such bias- variance trade-offs are applicable in a much broader context and in a variety 
of situations. We also show, using experiments, how existing bias-variance trade-offs can be applied in 
T-H novel circumstances to improve the performance of a class of optimization algorithms. 

00 

O 

j>! Motivation and Background 
• ^ 

In its simplest form, the bias-variance decomposition is based on the following idea. Say we have a 
^ Euclidean random variable F taking on values F distributed according to a density function p{F). We 

want to estimate what value we would get if were to sample p{F). However we do not (or cannot) do this 
simply by sampling F directly. Instead, to form our estimate, we sample a different Euclidean random 
variable F_ taking on values F distributed according to p{F). Assuming a quadratic loss function, the 
quality of our estimate is measured by its Mean Squared Error (MSE): 

MSE(|:) = [ p{F,F){F - FfdFdF. (1) 



Example 1: To illustrate Eq. [T] consider the simplest type of supervised machine learning problem, 
where there is a finite input space X, the output space Y is real numbers, and there is no noise. In such 
learning there is some deterministic 'target function' / that maps each element of X to a single element 
of Y. There is a 'prior' probability density function 7r(/) over target functions, and it gets sampled to 
produce some particular target function, /. Next, / is IID sampled at a set of m inputs to produce a 
'training set' V of input-output pairs. 
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For simplicity, say there is some single fixed "prediction point" x ^ X. Our goal in supervised learning 
is to estimate f{x). However / is not known to us. Accordingly, to perform the estimation the training 
set is presented to a 'learning algorithm', which in response to the training set produces a guess g{x) for 
the value f{x). 

This entire stochastic procedure defines a joint distribution n(f,'D,f(x)^g{x)). We can marginalize it 
to get a distribution Tr(f{x),g{x)). Since g{x) is supposed to be an estimate of f{x), we can identify 
g{x) as the value F of the random variable F and f{x) as the value F of F_. In other words, we can 
define p{F, F) — 7r(/(a;), g{x)). If we now ask what the mean squared error is of the guess made by our 
learning algorithm for the value f{x), we get Eq. [T| 

Note that one would expect that this F and F are statistically dependent (Indeed, if they weren't 
dependent, then the dependence of thelearning algorithm on 2? would be pointless.) Formally, the 
dependence can be established by writing 



p{f{x),g{x)) - I dVp{f{x),g{x)\V)p{V) 

dVp{g{x) I f{x),V) p{f{x)\V) p{V) 
dVp{g{x)\V)p{f{x)\V) p{V) 



(since the guess of the learning algorithm is determined in full by the training set), and then noting that 
in general this integral differs from the product 



p{f{x)) p{g{x)) = 



dVp{f{x)\V)p{V) 



dVp{g{x)\V)p{V) 



In Ex. 1 F_ and F_ are statistically coupled. Such coupling is extremely common. In practice though, 
such coupling is simply ignored in analyses of bias plus variance, without any justification. In particular 
Bayesian supervised learning avoids any explicit consideration of bias plus variance. For its part, non- 
Bayesian supervised learning avoids consideration of the coupling by replacing the distribution p{F, F) 
with the associated product of marginals, p{F)p{F). For now we follow that latter practice. So our 
equation for MSE reduces to 

MSE(|:) = J p{F)p{F) [F - FfdFdF. (2) 

(If we were to account for the coupling of F and £ an additive correction term would need to be added 
to the right-hand side. For instance, see |Wolpert|[l997] .) 



Using simple algebra, the right hand side of Eq. [2| can be written as the sum of three terms. The 
first is the variance of F_. Since this is beyond our control in designing the estimator _F, we ignore it for 
the rest of this article. The second term involves a mean that describes the deterministic component of 
the error. This term depends on both the distribution of F_ and that of F, and quantifies how close the 
means of those distributions are. The third term is a variance that describes stochastic variations from 
one sample to the next. This term is independent of the random variable being estimated. Formally, up 
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to an overall additive constant, we can write 

MSE(F) = [ p{F){F^ -2FF + F^)dF, 

p{F)F^dF ~2F / p{F)FdF + F 



W{F) 
W{F) 



[E{F)]^ -2F E{F) ■ 
[F-EiF)]^ 



F' 



— variance + bias 



(3) 



In light of Eq. [3] one way to try to reduce afexpected quadratic error is to modify an estimator to 
trade-off bias and variance. Some of the most famous applications of such bias-variance trade-offs occur 
in parametric machine learning, where many techniques have been developed to exploit the trade-off. 
However there are some extensions of that trade-off that could be applied in parametric machine learning 
that have been ignored by the community. We illustrate one of them here. 

Moreover, the bias-variance trade-off arises in many other fields besides parameteric machine learn- 
ing. In particular, as we illustrate here, it arises in integral estimation and optimization. In the rest 
of this paper we present some novel applications of the bias-variance trade-off, and describe some in- 
teresting features in each case. A recurring theme is that whenever a bias-variance trade-off arises in a 
particular field, we can use many techniques from parametric machine learning that have been developed 
for exploiting this trade-off. The novel applications of the tradeoff discussed here are instances of the 



Probability Collectives (PC) Wolpert and Rajnarayan 


2007 , 


Wolpert et al. 


awski 2004a|b , 


Macready and Wolpert 


2005 , a general approach to using ] 



2006 , Wolpert and Bieni- 



do blackbox optimization. 



Applications 

In this section, we describe some applications of the bias-variance tradeoff. First, we describe Monte 
Carlo (MC) techniques for the estimation of integrals, and provide a brief analysis of bias- variance trade- 
offs in this context. Next, we introduce the field of Monte Carlo Optimization (MCO), and illustrate 
that there are more subtleties involved than in simple MC. Then, we describe the field of Parametric 
Machine Learning, which, as will show, is formally identical to MCO. Finally, we present an application 
of Parametric Learning (PL) techniques to improve the performance of MCO algorithms. We do this in 
the context of an MCO problem that is central to how PC addresses black-box optimization. 



Monte Carlo Estimation of Integrals Using Importance Sampling 

Monte Carlo methods are often the method of choice for estimating difficult high-dimensional integrals. 
Consider a function /: X — > M, which we want to integrate over some region X C X, yielding the value 
F, as given by 

F^ f dxfix). 



X 



We can view this as a random variable _F, with density function given by a Dirac delta function centered 
on F . Therefore, the variance of Z is 0, and Eq.|3] is exact 



A popular MC method to estimate this integral is importance sampling [see Robert and Casella 



2004 



This exploits the law of large numbers as follows: i.i.d. samples i— 1, . . . ,m are generated 
from a so-called importance distribution h(x) that we control, and the associated values of the integrand, 
/(x^'^) are computed. Denote these 'data' by 

I? = {(a;«,/(x«), z = l,...,m}. (4) 
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Now, 

h{x) ' 



F = I dx h{x 
Jx 



lim — > ... with probability 1. 

m— >oo TO Mx'-^'^') 
i—l ^ ' 



Denote by F_ the random variable with value given by the sample average for T): 



m ^ — ^ 



We use F_ as our statistical estimator for _F, as we broadly described in the introductory section. Assum- 
ing a quadratic loss function, L{F,F) = {F — F)^, the bias-variance decomposition described in Eq. [s] 
applies exactly. It can be shown that the estimator IF is unbiased, that is, = F, where the mean is 

over samples of h. Consequently, the MSE of this estimator is just its variance. The choice of sampling 
distribution h that minimizes this variance is given by [see Robert and Casella] 2004 



hHx) = 

^^^^ j^mx')\dx'- 

By itself, this result is not very helpful, since the equation for the optimal importance distribution 
contains a similar integral to the one we are trying to estimate. For non-negative integrands f{x), the 



VEGAS algorithm Lepage 1978 describes an adaptive method to find successively better importance 
distributions, by iteratively estimating _F, and then using that estimate to generate the next importance 
distribution h. In the case of these unbiased estimators, there is no trade-off between bias and variance, 
and minimizing MSE is achieved by minimizing variance. 

Monte Carlo Optimization 

Instead of a fixed integral to evaluate, consider a parametrized integral 

F{d)= f dxfeix). 

JX 

Further, suppose we are interested in finding the value of the parameter e that minimizes F{6): 

9* = argmini^f^). 

eee 

In the case where the functional form of fg is not explicitly known, one approach to solve this problem is a 
technique called Monte Carlo Optimization (MCO) [see 'Ermoliev and Norkin 1998 , involving repeated 



MC estimation of the integral in question with adaptive modification of the parameter 6. 

We proceed by analogy to the case with MC. First, we introduce the 6'-indexed random variable F{9), 
all of whose components have delta- function distributions about the associated values F(9). Next, we 
introduce a ^-indexed vector random variable F_ with values 

F= {/'(e*) V6I e 6}. (5) 

Each real- valued component F_{9) can be sampled and viewed as an estimate of F{9). 

For example, let I? be a data set as described in Eq. [4] Then for every 9, any sample of T) provides 
an associated estimate 
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That average serves as an estimate of F{0). Formally, F is a function of the random variable V, and is 
given by such averaging over the elements of V. So, a sample of V provides a sample of P. A priori, 
we make no restrictions on P, and so, in general, its components may be statistically coupled with one 
another. Note that this coupling arises even though we are, for simplicity, treating each function F{9) 
as having a delta-function distribution, rather than as having a non-zero variance that would reflect our 
lack of knowledge of the f{9) functions. 

However P is defined, given a sample of P, one way to estimate 6* is 

e* = argmin/'(6') 

eee 

We call this approach 'natural' MCO. As an example, say that 2? is a set of m samples of h, and let 



as above. Under this choice for F, 



1=1 ^ ^ 

We call this approach 'naive' MCO. 

Consider any algorithm that estimates 9* as a single-valued function of P. The estimate of 9* 
produced by that algorithm is itself a random variable, since it is a function of the random variable p. 
Call this random variable H , taking on values 9* . Any MCO algorithm is defined by H ; that random 
variable encapsulates the output estimate made by the algorithm. 

To analyze the error of such an algorithm, consider the associated random variable given by the true 
parametrized integral F{9_ ). The difference between a sample of F{9_ ) and the true minimal value of 
the integral, F{6*) = min^i F{6), is the error introduced by our estimating that optimal 6* as a sample of 
(I . Since our aim in MCO is to minimize F{9), we adopt the loss function L((1 ,9*) = F{9 ) — F{9*). 
This is in contrast to our discussion on MC integration, which involved quadratic loss. The current loss 
function just equals F{([ ) up to an additive constant F{9*) that is fixed by the MCO problem at hand 
and is beyond our control. Up to that additive constant, the associated expected loss is 

E(L) = j d9*p{9*)F{9*). (7) 

Now change coordinates in this integral from the values of the scalar random variable ^ to the values 
of the underlying vector random variable p. The expected loss now becomes 

E(L) = j dP p{F)F{e*{F)). 

The natural MCO algorithm provides some insight into these results. For that algorithm, 

E(L) = j dP p{P)F{&vgmmP{e)) 

= J dP{9i)dP{92)... p{P{9i),P{92),...)F{aTgininP{9)). (8) 

For any fixed 6, there is an error between samples of ^(6*) and the true value F{6). Bias- variance 

considerations apply to this error, exacty as in the discussion of MC above. We are not, however, 
concerned with F for a single component 9, but rather for a set O of 0's. 
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The simplest such case is where the components of F(0) are independent. Even so, aicgmine F{6) 
is distributed according to the laws for extrema of multiple independent random variables, and this 
distribution depends on higher-order moments of each random variable F{9). This means that E[L] also 
depends on such higher-order moments. Only the first two moments, however, arise in the bias and 
variance for any single 9. Thus, even in the simplest possible case, the bias-variance considerations for 
the individual 6 do not provide a complete analysis. 

In most cases, the components of F are not independent. Therefore, in order to analyze E[L], in 
addition to higher moments of the distribution for each 6, we must now also consider higher-order 
moments coupling the estimates F{9) for different 9. 

Due to these effects, it may be quite acceptable for all the components F{9) to have both a large bias 
and a large variance, as long as they still order the ^'s correctly with respect to the true F{9). In such 
a situation, large covariances could ensure that if some F{9) were incorrectly large, then F{9'),9' ^ 9 
would also be incorrectly large. This c;oupling between the components of F would preserve the ordering 
of 9's under F. So, even with large bias and variance for each 9, the estimator as a whole would still 
work well. 

Nevertheless, it is sufficient to design estimators F(9) with sufficiently small bias plus variance 
for each single 9. More precisely, suppose that those terms are very small on the scale of differences 
F{9) — F{9') for any 9 and 9'. Then by Chebychev's inequality, we know that the density functions of 
the random variables iF{6) and F^{9') have almost no overlap. Accordingly, the probability that a sample 
of F{9) — F{9') has the opposite sign of F{9) — F{9') is almost zero. 

Evidently, E[I/] is generally determined by a complicated relationship involving bias, variance, co- 
variance, and higher moments. Natural MCO in general, and naive MCO in particular, ignore all of 
these effects, and consequently, often perform quite poorly in practice. In the next section we discuss 
some ways of addressing this problem. 



Pcirametric Machine Lectrning 

There are many versions of the basic MCO problem described in the previous section. Some of the 
best-explored arise in parametric density estimation and parametric supervised learning, which together 
comprise the field of Parametric machine Learning (PL). 

In particular, parametric supervised learning attempts to solve 



argmiii j dxp{x) j dyp{y | x)f0{x). 



Here, the values x represent inputs, and the values y represent corresponding outputs, generated ac- 
cording to some stochastic process defined by a set of conditional distributions {p{y \ x), x G X}. 
Typically, one tries to solve this problem by casting it as an MCO problem, For instance, say we adopt 
a quadratic loss between a predictor Z0{x) and the true value of y. Using MCO notation, we can express 
the associated supervised learning problem as finding argmin^ F{9), where 

le{x) = j dyp{y \ x) {z0{x) - yf, 
fg{x) = p{x)lg{x), 

F{9) = j dxfeix). (9) 

Next, the argmin is estimated by minimizing a sample-based estimate of the F(^)'s. More precisely, 
we are given a 'training set' of samples oip{y \ x)p{x), {{x^^\y^)i = 1, . . . , m}. This training set provides 
a set of associated estimates of F{9): 



^ m 



m • 

i=l 
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These are used to estimate argmine exactly as in MCO. In particular, one could estimate the 

minimizer of F(9) by finding the minimium of F(9), just as in natural MCO. As mentioned above, this 
MCO algorithm can perform very poorly in practice. In PL, this poor performance is called 'overfitting 
the data'. 

There are several formal approaches that have been explored in PL to try to address this 'overfitting 
the data'. Interestingly, none are based on direct consideration of the random variable F{6*{F)) and 
the ramifications of its distribution for expected loss (cf. Eq. [S]). In particular, no work has applied the 
mathematics of extrema of multiple random variables to analyze the bias-variance-covariance trade-offs 
encapsulated in Eq. |8] 

The PL approach that perhaps comes closest to such direct consideration of the distribution of 
F(9 ) is uniform convergence theory, which is a central part of Computational Learning Theory [see 
Angluin [1992] . Uniform convergence theory starts by crudely encapsulating the quadratic loss formula 
for expected loss under natural MCO, Eq. [Sj It does this by considering the worst-case bound, over 
possible p{x) and p(y \ x), of the probability that F(6*) exceeds imng F(6) by more than k. It then 
examines how that bound varies with k. In particular, it relates such variation to characteristics of the 



Vapnik 1982 1995 



set of functions {fg : e O}, e.g., the 'VC dimension' of that set [see 

Another, historically earlier approach, is to apply bias-plus-variance considerations to the entire PL 
algorithm , rather than to each F^{0) separately. This approach is applicable for algorithms that do 
not use natural MCO, and even for non-parametric supervised learning. As formulated for parameteric 
supervised learning, this approach combines the formulas in Eq. [9] to write 



Fi0) 



dxdy p{x)p{y \ x){ze{x) - yf 



This is then substituted into Eq. [7] giving 

E[L] = / d0*dxdyp{x)p{y \ x) p{0*){zg^{x) - y)^ 



dx p{x) 



dtdyp{x)p{y\x)p{t){zg^{x)~yf 



(10) 



The term in square brackets is an x-parameterized expected quadratic loss, which can be decomposed 
into a bias, variance, etc., in the usual way. This formulation eliminates any direct concern for issues 
like the distribution of extrema of multiple random variables, covariances between F_{0) and F_{0') for 
different values of 0, and so on. 

There are numerous other approaches for addressing the problems of natural MCO that have been 
explored in PL. Particulary important among these are Bayesian approaches, e.g., |Buntine and Weigend| 
1991 , Berger 1985 , Mackay 2003 . Based on these approaches, as well as on intuition, many powerful 



techniques for addressing data-overfitting have been explored in PL, including regularization, cross- 
validation, stacking, bagging, etc. Essentially all of these techniques can be applied to any MCO 
problem, not just PL problems. Since many of these techniques can be justified using Eq. |10| they 
provide a way to exploit the bias- variance trade-off in other domains besides PL. 



PLMCO 

In this section, we illustrate how PL techniques that exploit the bias- variance decomposition of Eq. [TO] 
can be used to improve an MCO algorithm used in a domain outside of PL. This MCO algorithm is a 
version of adaptive importance sampling, somewhat similar to the CE method jRubinstein and Kroese 
[2004], and is related to function smoothing on continuous spaces. The PL techniques described are 
applicable to any other MCO problem, and this particular one is chosen just as an example. 
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MCO Problem Description 

Consider the problem of finding the ^-parameterized distribution qg that minimizes the associated ex- 
pected value of a function G: M" M, i.e., find 

arg min Eg^ [G] . 

9 

We are interested in versions of this problem where we do not know the functional form of G, but can 
obtain its value G{x) at any x & X . Similarly we cannot assume that G is smooth, nor can we evaluate 



its derivatives directly. This scenario arises in many fields, including blackbox optimization [see Wolpert 



et al. 2006 , and risk minimization [see Ermoliev and Norkin 1998 



We begin by expressing this minimization problem as an MCO problem. Write 

EqJG] = / dxqe{x)G{x) 

Using MCO terminology, fe{x) = qg{x)G{x) and F{9) = E,qg[G]. To apply MCO, we must define a 
vector- valued random variable -F with components indexed by 6, and then use a sample of F to estimate 
arg mine Egg [G] . In particular, to apply naive MCO to estimate argmineEg^(G), we first i.i.d. sample a 
density function h{x). By evaluating the associated values of G{x) we get a data set 

V = {Vx,Vg) 

= : i = 1, . . . , to}, {G{x^^) : i = 1, . . . , to}). 



The associated estimates of F{6) for each 9 are 



771 ^ hix^^l] 



i=l 



h{xi^ 



The associated naive MCO estimate of arg mine E^^ [G] is 

e* = arg min ^(6*). 

e 

Suppose G includes all possible density functions over x's. Then the qe minimizing our estimate is a 
delta function about the a;'-'^ £ T>x with the lowest associated value of G(a;'^'^)//i(x^*-'). This is clearly a 
poor estimate in general; it suffers from 'data-overfitting'. Proceeding as in PL, one way to address this 
data-overfitting is to use regularization. In particular, we can use the entropic regularizer, given by the 
negative of the Shannon entropy S{qe). So we now want to find the minimizer of Eg^[G(a;)] — TS{qe), 
where T is the regularization parameter. Equivalently, we can minimize PEqg[G{x)] — S{qg), where 
(3—1 /T. This changes the definition of F from the function given in Eq. [ll] to 

/3g0(2;W)G(a;W) 



TO ^ — ' /i(a;(*)) 
I— 1 ^ ^ 



Find the solution to this minimization problem is the focus of the PC approach to blackbox optimization. 
Solution Methodology 

Unfortunately, it can be difficult to find the 9 globally minimizing this new F for an arbitrary V. An 
alternative is to find a close approximation to that optimal 9. One way to do this is as follows. First, 
we find the minimizer of 



1 y. /?p(a:W)G(xW) 
m ^ /i(a;(*)) 



- S{p) (12) 
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over the set of all possible distributions p{x) with domain X. We then find the qg that has minimal 
Kullback-Leibler (KL) divergence from this p, evaluated over Vx- That serves as our approximation to 
argmine F(6), and therefore as our estimate of the 6 that minimizes Eqg{G). 

The minimizer p of Eq. [12] can be found in closed form; over Dx it is the Boltzmann distribution 
p/3(2;(0) (X exp{— P G{x^'^^)) . The KL divergence in Vx from this Boltzmann distribution to qg is 

F{0) - KL{p%g) = f dxpf'{x)log fP^] . 
The minimizer of this KL divergence is given by 

0^ = argmin[-2^ h(^) ))]■ (1^) 

6^ is an approximation to the estimate of the 9 that minimizes Egg (G) given by the regularized version 
of naive MCO. Our incorporation of regularization here has the same motivation as it does in PL: to 
reduce bias plus variance. 



Log-concave Densities 



If qe is log-concave in its parameters 6, then the minimization problem in Eq.[T3]is a convex optimization 
problem, and the optimal parameters can b e fo und closed-form. Denote the likelihood ratios by s^*-* — 
exp(-/3G(a;(*)))//i(x(*)). Differentiating Eq. 
them to zero yields 
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with respect to the parameters /i and E ^ and setting 



fi*){x<''^ - fl*) 



Mixture Models 

The single Gaussian is a fairly restrictive class of models. Mixture models can significantly improve 
flexibility, but at the cost of convexity of the KL distance minimization problem. However, a plethora of 
techniques from supervized learning, in particular the Expectation Maximization (EM) algorithm, can 
be applied with minor modifications. 

Suppose qg is a mixture of M Gaussians, that is, 9 = (fi, E, (j)) where (j) is the mixing p.m.f, we can 
view the problem as one where a hidden variable z decides which mixture component each sample is 
drawn from. We then have the optimization problem 

minimize - ^ log (^qg{x^'\ z^'^)') . 

Following the standard EM procedure, we get the algorithm described in Eq.[T4] Since this is a nonconvex 
problem, one typically runs the algorithm multiple times with random initializations of the parameters. 

E-step: For each i, set Q^{z'^'^) = p(z«|a;W), 

that is, = g;.,E,0(z(^) = j|x«), j = l,...,M. 



3 

M-step: Set fij 



E, 



' (14) 
E.^^'sW (a;«-/i,)(x«-M,)^ 
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Test Problems 

To compare the performance of this algorithm with and without the use of PL techniques, we use a 
couple of very simple academic problems in two and four dimensions - the Rosenbrock function in two 
dimensions, given by 

Gr{x) = 100(a;2 - xff + (1 - a;l)^ 
and the Woods function in four dimensions, given by given by 

+ 10.1[(1 - X2f + (1 - Xif] + 19.8(1 - X2){1 - Xi). 

For the Rosenbrock, the optimum value of is achieved at a; = (1, 1), and for the Woods problem, the 
optimum value of is achieved at a; = (1, 1, 1, 1). 

Application of PL Techniques 

As mentioned above, there are many PL techniques beyond regularization that are designed to optimize 
the trade-off between bias and variance. So having cast the solution of arg min^g E(G) as an MCO 
problem, we can apply those other PL techniques instead of (or in addition to) entropic regularization. 
This should improve the performance of our MCO algorithm, for the exact same reason that using those 
techniques to trade off bias and variance improves performance in PL. We briefly mention some of those 
alternative techniques here. 

The overall MCO algorithm is broadly described in Alg. [l] For the Woods problem, 20 samples of 
X are drawn from the updated qg at each iteration, and for the Rosenbrock, 10 samples. For comparing 
various methods and plotting purposes, 1000 samples of G{x) are drawn to evaluate Eg^, [0(0;)]. Note: in 
an actual optimization, we will not be drawing these test samples! All the performance results in Fig. [l] 
are based on 50 runs of the PC algorithm, randomly initialized each time. The sample mean performance 
across these runs is plotted along with 95% confidence intervals for this sample mean (shaded regions). 



Algorithm 1 Overview of pq minimization using Gaussian mixtures 
1: Draw uniform random samples on X 
2: Initialize regularization parameter (3 
3: Compute G{x) values for those samples 
4: repeat 

5: Find a mixture distribution qg to minimize sampled pq KL distance 
6: Sample from qg 
7: Compute G{x) for those samples 
8: Update f3 
9: until Termination 
10: Sample final qg to get solution(s). 



Cross-validation for Regularization: We note that we are using regularization to reduce variance, 
but that regularization introduces bias. As is done in PL, we use standard /c-fold cross-validation to 
tradeoff this bias and variance. We do this by partitioning the data into k disjoint sets. The held-out 
data for the i"^ fold is just the i**^ partition, and the held-in data is the union of all other partitions. 
First, we 'train' the regularized algorithm on the held-in data Vt to get an optimal set of parameters 9*, 
then 'test' this 9* by considering unregularized performance on the held-out data Vy. In our context, 
'training' refers to finding optimal parameters by KL distance minimization using the held-in data, and 



testing' refers to estimating E^^, [G(a;)] on the held-out data using the following formula Robert and 
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Casella 2004 



ge(a;«)G(xW) 



E 



/?,(a;W) 

We do this for several values of the regularization parameter /3 in the interval fci/3 < /3 < fc2/3, and 
choose the one that yield the best held-out performance, averaged over all folds. For our experiments, 
k\ — 0.5, k2 — 3, and we use 5 equally-spaced values in this interval. Having found the best regularization 
parameter in this range, we then use all the data to minimize KL distance using this optimal value of /3. 
Note that all cross-validation is done without any additional evaluations of G{x). Cross-validation for /3 
in PC is similar to optimizing the annealing schedule in simulated annealing. This 'auto-annealing' is 
seen in Fig. [T]a, which shows the variation of /? with iterations of the Rosenbrock problem. It can be 
seen that (3 value sometimes decreases from one iteration to the next. This can never happen in any kind 
of 'geometric annealing schedule', f3 <— kp > 1, of the sort that is often used in most algorithms 

in the literature. In fact, we ran 50 trials of this algorithm on the Rosenbrock and then computed a 
best-fit geometric variation for /3, that is, a nonlinear least squares fit to variation of /3, and a linear least 
squares fit to the variation of log(/3). These are shown in Figs.[l]c. and[l]d. As can be seen, neither is a 
very good fit. We then ran 50 trials of the algorithm with the fixed update rule obtained by best-fit to 
log(/3), and found that the adaptive setting of /? using cross-validation performed an order of magnitude 
better, as shown in Fig. [l]e. 

Cross-validation for Model Selection: Given a set Q (sometimes called a model class) to choose 
6 from, we can find an optimal 6 £ Q. But how do we choose the set 0? In PL, this is done using 
cross-validation. We choose that set Q such that argmin^ge has the best held-out performance. 
As before, we use that model class Q that yields the lowest estimate of Eg^ [G(a;)] on the held-out data. 
We demonstrate the use of this PL technique for minimizing the Rosenbrock problem, which has a 
long curved valley that is poorly approximated by a single Gaussian. We use cross-validation to choose 
between a Gaussian mixture with up to 4 components. The improvement in performance is shown in 
Fig.[l]d. 



Bagging: In bagging Breiman 1996a , we generate multiple data sets by resampling the given data 
set with replacement. These new data sets will, in general, contain replicates. We 'train' the learning 
algorithm on each of these resampled data sets, and average the results. In our case, we average the 
qe got by our KL divergence minimization on each data set. PC works even on stochastic objective 
functions, and on the noisy Rosenbrock, we implemented PC with bagging by resampling 10 times, and 
obtained significant performance gains, as seen in Fig. [T|g. 

Stacking: In bagging, we combine estimates of the same learning algorithm on different data sets 



generated by resampling, whereas in stacking Breiman] [1996b , Smyth and Wolpert 1999 , we combine 



estimates of different learning algorithms on the same data set. These combined estimated are often 
better than any of the single estimates. In our case, we combine the qg obtained from our KL divergence 
minimization algorithm using multiple models Q. Again, Fig. [T|h shows that cross-validation for model 
selection performs better than a single model, and stacking performs slightly better than cross-validation. 



Conclusions 

The conventional goal of reducing bias plus variance has interesting applications in a variety of fields. In 
straightforward applications, the bias- variance trade-offs can decrease the MSE of estimators, reduce the 
generalization error of learning algorithms, and so on. In this article, we described a novel application of 
bias-variance trade-offs: we placed bias-variance trade-offs in the context of Monte Carlo Optimization, 
and discussed the need for higher moments in the trade-off, such as a bias-variance-covariance trade-off. 
We also showed a way of applying just a bias-variance trade-off, as used in Parametric Learning, to 
improve the performance of Monte Carlo Optimization algorithms. 
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